Smoothing Categorical Data

نویسندگان

  • Arno Siebes
  • René Kersten
چکیده

Global models of a dataset reflect not only the large scale structure of the data distribution, they also reflect small(er) scale structure. Hence, if one wants to see the large scale structure, one should somehow subtract this smaller scale structure from the model. While for some kinds of model – such as boosted classifiers – it is easy to see the “important” components, for many kind of models this is far harder, if at all possible. In such cases one might try an implicit approach: simplify the data distribution without changing the large scale structure. That is, one might first smooth the local structure out of the dataset. Then induce a new model from this smoothed dataset. This new model should now reflect the large scale structure of the original dataset. In this paper we propose such a smoothing for categorical data and for one particular type of models, viz., code tables. By experiments we show that our approach preserves the large scale structure of a dataset well. That is, the smoothed dataset is simpler while the original and smoothed datasets share the same large scale structure.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Local Smoothing with given Marginals

In models using categorical data one may use adjacency relations to justify smoothing to improve upon simple histogram approximations of the probabilities. This is particularly convenient for sparsely observed or rather peaked distributions. Moreover, in a few models, prior knowledge of a marginal distribution is available. We adapt local polynomial estimators to include this partial informatio...

متن کامل

Analyzing Longitudinal Data Using Gee-Smoothing Spline

This paper considers nonparametric regression to analyze longitudinal data. Some developments of nonparametric regression have been achieved for longitudinal or clustered categorical data. For exponential family distribution, Lin & Carroll [6] considered nonparametric regression for longitudinal data using GEE-Local Polynomial Kernel (LPK). They showed that in order to obtain an efficient estim...

متن کامل

Statistical Notions of Data Disclosure Avoidance and Their Relationship to Traditional Statistical Methodology: Data Swapping and Loglinear Models

For most data releases especially those from censuses, the U. S. Bureau of the Census has either released data at high levels of aggregation or applied a data disclosure avoidance procedure such as data swapping or cell suppression before preparing micro-data or tables for release. In this paper, we present a general statistical characterization of the goal of a statistical agency in releasing ...

متن کامل

Bayesian inference for categorical data analysis

This article surveysBayesianmethods for categorical data analysis, with primary emphasis on contingency table analysis. Early innovations were proposed by Good (1953, 1956, 1965) for smoothing proportions in contingency tables and by Lindley (1964) for inference about odds ratios. These approaches primarily used conjugate beta and Dirichlet priors. Altham (1969, 1971) presented Bayesian analogs...

متن کامل

Prediction of Notes from Vocal Time Series Produced by Singing Voice

Aiming at optimal prediction of the correct note corresponding to a vocal time series we trained a classification algorithm on the basis of parts of interpretations of Tochter Zion (Händel) and tested the algorithm on the remaining parts. As classification algorithm we use a radial basis function support vector machine together with a “Hidden Markov” method as a dynamisation mechanism and some ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012